6 research outputs found
Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities
Modern sequence to sequence neural TTS systems provide close to natural
speech quality. Such systems usually comprise a network converting
linguistic/phonetic features sequence to an acoustic features sequence,
cascaded with a neural vocoder. The generated speech prosody (i.e. phoneme
durations, pitch and loudness) is implicitly present in the acoustic features,
being mixed with spectral information. Although the speech sounds natural, its
prosody realization is randomly chosen and cannot be easily altered. The
prosody control becomes an even more difficult task if no prosodic labeling is
present in the training data. Recently, much progress has been achieved in
unsupervised speaking style learning and generation, however human inspection
is still required after the training for discovery and interpretation of the
speaking styles learned by the system. In this work we introduce a fully
automatic method that makes the system aware of the prosody and enables
sentence-wise speaking pace and expressiveness control on a continuous scale.
While being useful by itself in many applications, the proposed prosody control
can also improve the overall quality and expressiveness of the synthesized
speech, as demonstrated by subjective listening evaluations. We also propose a
novel augmented attention mechanism, that facilitates better pace control
sensitivity and faster attention convergence.Comment: published at 10th ISCA Speech Synthesis Workshop (SSW-10, 2019
A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers
Modern neural TTS systems are capable of generating natural and expressive
speech when provided with sufficient amounts of training data. Such systems can
be equipped with prosody-control functionality, allowing for more direct
shaping of the speech output at inference time. In some TTS applications, it
may be desirable to have an option that guides the TTS system with an ad-hoc
speech recording exemplar to impose an implicit fine-grained, user-preferred
prosodic realization for certain input prompts. In this work we present a
first-of-its-kind neural TTS system equipped with such functionality to
transfer the prosody from a parallel text recording from an unseen speaker. We
demonstrate that the proposed system can precisely transfer the speech prosody
from novel speakers to various trained TTS voices with no quality degradation,
while preserving the target TTS speakers' identity, as evaluated by a set of
subjective listening experiments.Comment: Presented at Interspeech 202
Speak While You Think: Streaming Speech Synthesis During Text Generation
Large Language Models (LLMs) demonstrate impressive capabilities, yet
interaction with these models is mostly facilitated through text. Using
Text-To-Speech to synthesize LLM outputs typically results in notable latency,
which is impractical for fluent voice conversations. We propose LLM2Speech, an
architecture to synthesize speech while text is being generated by an LLM which
yields significant latency reduction. LLM2Speech mimics the predictions of a
non-streaming teacher model while limiting the exposure to future context in
order to enable streaming. It exploits the hidden embeddings of the LLM, a
by-product of the text generation that contains informative semantic context.
Experimental results show that LLM2Speech maintains the teacher's quality while
reducing the latency to enable natural conversations.Comment: Under review for ICASSP 202